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Universal  Sparse  Modeling 

Ignacio  Ramirez  and  Guillermo  Sapiro 
Department  of  Electrical  and  Computer  Engineering 
University  of  Minnesota 
{ramir048,guille}  @  umn.edu 


Abstract 

Sparse  data  models,  where  data  is  assumed  to  be  well  represented  as  a  linear  combination  of  a  few 
elements  from  a  dictionary,  have  gained  considerable  attention  in  recent  years,  and  their  use  has  led  to 
state-of-the-art  results  in  many  signal  and  image  processing  tasks.  It  is  now  well  understood  that  the 
choice  of  the  sparsity  regularization  term  is  critical  in  the  success  of  such  models.  In  this  work,  we 
use  tools  from  information  theory,  and  in  particular  universal  coding  theory,  to  propose  a  framework 
for  designing  sparsity  regularization  terms  which  have  several  theoretical  and  practical  advantages  when 
compared  to  the  more  standard  £q  or  £\  ones,  and  which  lead  to  improved  coding  performance  and 
accuracy  in  reconstruction  and  classification  tasks.  We  also  report  on  further  improvements  obtained 
by  imposing  low  mutual  coherence  and  Gram  matrix  norm  on  the  corresponding  learned  dictionaries. 

The  presentation  of  the  framework  and  theoretical  foundations  is  complemented  with  examples  in  image 
denoising  and  classification. 

EDICS:  MLR-INFO,  SSP-APPL,  SSP-SNMD,  SSP-SSAN. 

I.  Introduction 

Sparse  modeling  calls  for  constructing  a  succinct  representation  of  some  data  as  a  combination  of 
a  few  typical  patterns  {atoms)  learned  from  the  data  itself.  Significant  contributions  to  the  theory  and 
practice  of  learning  such  collections  of  atoms  (usually  called  dictionaries  or  codebooks),  e.g.,  [1],  [14], 

[31] ,  and  of  representing  the  actual  data  in  terms  of  them,  e.g.,  [9],  [11],  [12],  have  been  developed  in 
recent  years,  leading  to  state-of-the-art  results  in  many  signal  and  image  processing  tasks  [24],  [26],  [27], 

[32] ,  We  refer  the  reader  for  example  to  [5]  for  a  recent  review  on  the  subject. 

A  critical  component  of  sparse  modeling  is  the  actual  sparsity  of  the  representation,  which  is  controlled 
by  a  regularization  term  {regularizer  for  short)  and  its  associated  parameters.  The  choice  of  the  functional 
form  of  the  regularizer  and  its  parameters  is  a  challenging  task.  Several  solutions  to  this  problem  have 
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been  proposed  in  the  literature,  ranging  from  the  automatic  tuning  of  the  parameters  [20]  to  Bayesian 
models,  where  these  parameters  are  themselves  considered  as  random  variables  [17],  [20],  [46]. 

In  this  paper  we  address  this  challenge  by  proposing  a  family  of  regularizers  that  are  robust  under 
the  choice  of  their  parameters.  These  regularizers  are  derived  using  tools  from  information  theory, 
more  specifically,  from  universal  coding  theory.  The  main  idea  is  to  consider  sparse  modeling  as  a 
codelength  minimization  problem,  where  the  regularizers  define,  through  a  probability  assignment  model, 
the  codelength  associated  with  the  description  of  the  sparse  representation  coefficients.  We  then  design 
these  regularizers  so  that  the  associated  codelength  for  describing  the  coefficients  generating  any  possible 
signal  (from  a  given  class  of  signals),  is  close  to  the  best  possible  codelength  achievable  for  that  particular 
instance  of  coefficients.  Encoding  schemes  with  this  property,  and  their  associated  regularizers,  are  called 
universal  [29],  where  universality  is  defined  with  respect  to  the  set  of  signals  that  can  be  observed  in 
practice  in  a  given  application.  These  concepts  will  be  formally  introduced  in  the  paper. 

The  universal  regularizers  that  we  obtain  with  our  framework  have  several  desirable  theoretical  and 
practical  properties  such  as  statistical  consistency,  improved  robustness  to  outliers  in  the  data,  and  lead  to 
a  better  sparse  signal  recovery  (e.g.,  decoding  of  sparse  signals  in  compressive  sensing)  than  £q  and  i\- 
based  techniques  in  practice.  This  new  family  of  models  is  complemented  by  imposing  incoherence  in  the 
learned  dictionary.  We  illustrate  with  tasks  from  image  processing  the  power  of  these  new  models.  Finally, 
the  introduction  of  tools  from  universal  modeling  into  the  sparse  world  permits  to  bring  a  fundamental 
and  well  supported  theoretical  angle  to  this  very  important  and  popular  area  of  research. 

The  remainder  of  this  paper  is  organized  as  follows:  in  Section  II  we  introduce  the  standard  framework 
of  sparse  modeling.  Section  III  is  dedicated  to  the  derivation  of  our  proposed  universal  modeling  frame¬ 
work,  while  Section  IV  deals  with  its  implementation.  Section  V  presents  experimental  results  showing 
the  practical  benefits  of  the  proposed  framework  for  image  representation  and  classification.  Concluding 
remarks  are  given  in  Section  VI. 

II.  Sparse  modeling  and  the  need  for  better  models 

Let  X  G  WMxN  be  a  set  of  N  column  data  samples  xj  G  Mm,  D  g  M.MxK  a  dictionary  of  K  atoms 
represented  as  columns  G  Mm,  and  A  G  M.KxN ,  slj  G  M^,  a  set  of  reconstruction  coefficients  such  that 
X  =  D  A.  We  use  a j.  to  denote  the  k-th  row  of  A,  which  corresponds  to  the  coefficients  associated  to  the 
/,:-th  atom  in  D.  For  each  j  =  l,...,JVwe  define  the  active  set  of  a j  as  Aj  =  {k  :  akj  7^  0,1  <  k  <  K}, 
and  1 1  a  j  1 1 0  =  Aj  as  its  cardinality.  The  goal  of  sparse  modeling  is  to  design  a  dictionary  D  such  that  for 
all  or  most  data  samples  xj,  there  exists  a  coefficients  vector  a j  such  that  x;  ~  D  aj  and  ||aj||0  is  small 
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(usually  below  some  threshold  L  <C  K).  Formally,  we  would  like  to  solve  the  following  optimization 
problem  in  (D,  A), 

N 

min  y~^(aj)  s.t.  \\x.j  —  D a.j\\^  <  e,  j  =  l,...,N  (1) 

D,A  3=1 

where  'ip(-)  is  a  regularization  function,  or  regularizer,  which  induces  sparsity  in  the  columns  of  the 
solution  A.  Usually  the  constraint  ||dfc||2  <1,  k  =  1, . . . ,  K ,  is  added,  since  otherwise  we  can  always 
decrease  the  cost  function  arbitrarily  by  multiplying  D  by  a  large  constant  and  dividing  A  by  the  same 
constant.  When  D  is  fixed,  the  problem  of  finding  a  sparse  a j  for  each  sample  x?  is  called  sparse  coding, 

a.j  =  argmin  V’(aj)  s.t.  llxi  ~  E)  a2 j  1 2  <  e.  (2) 

Among  possible  choices  of  'ip(-)  are  the  (f  pseudo-norm,  U(-)  =  |j-||0,  and  the  l\  norm.  The  former 
tries  to  solve  directly  for  the  sparsest  a j,  but  since  it  is  non-convex,  it  is  commonly  replaced  by  the  i\ 
norm,  which  is  its  closest  convex  approximation.  Furthermore,  under  certain  conditions  on  (fixed)  D  and 
the  sparsity  of  a j,  the  solutions  to  the  Iq  and  -based  sparse  coding  problems  coincide  (see  for  example 

[6]).  The  problem  (1)  is  also  usually  formulated  in  Lagrangian  form, 

N 

min  llxj  —  D  II2  +  AV>(a  j),  (3) 

3=1 

along  with  its  respective  sparse  coding  problem  when  D  is  fixed, 

a.j  =  argmin  ||xj  —  D  a||2  +  A^(a)-  (4) 

Even  when  the  regularizer  'ip(-)  is  convex,  the  sparse  modeling  problem,  in  any  of  its  forms,  is  jointly 
non-convex  in  (D,  A).  Therefore,  the  standard  approach  to  find  an  approximate  solution  is  to  use  alternate 
minimization:  starting  with  an  initial  dictionary  D(0\  we  minimize  (3)  alternatively  in  A  via  (2)  or  (4) 
(sparse  coding  step),  and  then  D  (dictionary  update  step).  The  sparse  coding  step  can  be  solved  efficiently 
when  U(-)  is  the  l\  norm,  using  for  example  Iterative  Shrinkage  [11]  or  LARS  [12],  or  with  OMP  [28] 
when  the  regularizer  is  the  £0  pseudo-norm.  The  dictionary  update  step  can  be  done  using  for  example 
MOD  [14]  or  K-SVD  [1], 

A.  Interpretations  of  the  sparse  coding  problem 

We  now  turn  our  attention  to  the  sparse  coding  problem:  given  a  fixed  dictionary  D,  for  each  sample 
vector  Xj,  compute  the  sparsest  vector  of  coefficients  a  j  that  yields  a  good  approximation  of  Xj.  The 
sparse  coding  problem  admits  several  interpretations.  What  follows  is  a  summary  of  these  interpretations 
and  the  insights  that  they  provide  into  the  properties  of  the  sparse  models. 
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1)  Model  selection  in  statistics:  Using  the  norm  as  ?/>(•)  in  (4)  is  known  in  the  statistics  community 
as  the  Akaike’s  Information  Criterion  (AIC)  when  A  =  1,  or  the  Bayes  Information  Criterion  (BIC) 
when  A  =  ^  log  M,  two  popular  forms  of  model  selection  (see  [  12,  Chapter  7]).  In  this  context,  the 
t\  regularizer  was  introduced  in  [38],  again  as  a  convex  approximation  of  the  above  model  selection 
methods,  and  is  commonly  known  (either  in  its  constrained  or  Lagrangian  forms)  as  the  Lasso.  Note 
however  that,  in  the  regression  interpretation  of  (4),  the  interpretation  of  D  and  X  is  very  different. 

2)  Compressive  sensing:  The  sparse  coding  problem  also  appears,  usually  in  its  constrained  form  (2), 
at  the  core  of  compressive  sensing  [6],  In  this  case,  the  vectors  Xj  are  often  random  projections  onto  the 
rows  of  the  sensing  matrix  D  of  the  unknown  sparse  data  to  be  recovered  a j  (note  the  very  different 
role  of  Xj,  a.j  and  D  in  this  context).  One  of  the  main  results  in  compressive  sensing  states  that,  under 
certain  conditions  on  D  and  X,  one  can  obtain  the  exact  solution  to  the  sparse  coding  problem  by 
solving  the  () -based  problem  instead,  which  can  be  done  using  standard  optimization  techniques  [6]. 

3)  Maximum  a  posteriori:  Another  interpretation  of  (4)  is  that  of  a  maximum  a  posteriori  (MAP) 
estimation  of  a j  in  the  logarithmic  scale,  that  is 

aj  =  argrnax  (logP(a|xj)}  =  argmax  (logP(xj|a)  +  log.P(a)} 

=  argmin  {—  logP(x,ja)  —  logP(a)},  (5) 

a  J 

where  the  observed  samples  Xj  are  assumed  to  be  contaminated  with  additive,  zero  mean,  IID  Gaussian 
noise  with  variance  a2, 

P(xj|a)  cx  e“5^|lXi“DaH»,  (6) 

and  a  prior  probability  model  on  a  with  the  form 

P{ a)  oc  e-9^  (7) 

is  considered.  The  energy  term  in  Equation  (4)  follows  by  plugging  (6)  and  (7)  into  (5)  and  factorizing 

2(j2  into  A  =  2a2 0.  According  to  (5)  and  (7)  we  have  that  the  i\  regularizer  corresponds  to  an  IID 
Laplacian  prior  with  mean  0  and  inverse-scale  parameter  0,  T’(a)  =  n['=i  0e~e\ak\  =  0K  e~9^  a  I  ' ,  which 
has  a  special  meaning  in  signal  processing  tasks  such  as  image  or  audio  compression.  This  is  due  to 
the  widely  accepted  fact  that  representation  coefficients  derived  from  predictive  coding  of  continuous¬ 
valued  signals,  and,  more  generally,  responses  from  zero-mean  filters,  are  well  modeled  using  Laplacian 
distributions.  For  example,  for  the  special  case  of  DCT  coefficients  of  image  patches,  an  analytical  study 
of  this  phenomenon  is  provided  in  [25],  along  with  further  references  on  the  subject. 
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4)  Codelength  minimization:  Sparse  coding,  in  all  its  forms,  has  yet  another  important  interpretation. 
Suppose  that  we  have  a  fixed  dictionary  D  and  that  we  want  to  use  it  to  compress  an  image,  either 
losslessly  by  encoding  the  reconstruction  coefficients  A  and  the  residual  X  —  DA,  or  in  a  lossy  manner, 
by  obtaining  a  good  approximation  X  ~  DA  and  encoding  only  A.  Consider  for  example  the  latter  case. 
Most  modern  compression  schemes  consist  of  two  parts:  a  probability  assignment  stage,  where  the  data, 
in  this  case  A,  is  assigned  a  probability  P( A),  and  an  encoding  stage,  where  a  code  C(A)  of  length 
L( A)  bits  is  assigned  to  the  data  given  its  probability,  so  that  L( A)  is  as  short  as  possible.  The  techniques 
known  as  Arithmetic  and  Huffman  coding  provide  the  best  possible  solution  for  the  encoding  step,  which 
is  to  approximate  the  Shannon  ideal  codelength  L( A)  =  —  logP(A)  [10,  Chapter  5].  Therefore,  modern 
compression  theory  deals  with  maximizing  P( A),  or,  equivalently,  minimizing  —  logP(A).  Now,  to 
encode  X  lossily  we  obtain  coefficients  A  such  that  each  data  sample  x.;  is  approximated  up  to  a  certain 
distortion  e,  || —  Da;  II 2  —  e-  Therefore,  given  a  model  P( a)  for  a  vector  of  reconstruction  coefficients, 
and  assuming  that  we  encode  each  sample  independently,  the  optimum  vector  of  coefficients  a j  for  each 
sample  Xj  will  be  the  solution  to  the  optimization  problem 

SLj  =  argmiri  —  logP(a)  s.t.  ||xj  —  D  a.,  ^  <  e,  (8) 

which,  for  the  choice  P( a)  oc  coincides  with  the  error  constrained  sparse  coding  problem  (2). 

Suppose  now  that  we  want  lossless  compression.  In  this  case  we  also  need  to  encode  the  reconstruction 
residual  Xj  —  Da^.  Since  P(x,  a)  =  P(x|a)P(a),  the  combined  codelength  will  be 

=  “log P(xj,aj)  =  -logP(xj|ai)  -  logP(ai).  (9) 

Therefore,  obtaining  the  best  coefficients  a;  amounts  to  solving  mina  L(xj,  &j),  which  is  precisely  the 
MAP  formulation  of  (5),  which  in  turn,  for  proper  choices  of  P(x|a)  and  P( a),  leads  to  the  Lagrangian 
form  of  sparse  coding  (4). 

As  one  can  see,  the  codelength  interpretation  of  sparse  coding  is  able  to  unify  and  interpret  both  the 
constrained  and  unconstrained  formulations  into  one  consistent  framework.  Furthermore,  this  framework 
offers  a  natural  and  objective  measure  for  comparing  the  quality  of  different  models  P(x|a)  and  P(a)  in 
terms  of  the  codelengths  obtained.  The  Laplacian  model  for  A  is  widely  used  in  image  compression,  for 
example  in  the  JPEG-LS  standard,  which  applies  a  Laplacian  model  to  describe  prediction  errors  [43]. 

Before  continuing,  we  need  to  clarify  an  important  technical  detail:  Laplacian  models,  as  well  as 
Gaussian  models,  are  probability  distributions  over  M,  characterized  by  continuous  probability  density 
functions,  f(a)  =  F'(a),  F(a )  =  Fix  <  a).  If  the  reconstruction  coefficients  are  considered  real  numbers, 
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Fig.  1:  Standard  8x8  DCT  dictionary  (left),  global  empirical  distribution  of  the  coefficients  in  A  (center,  log  scale)  and  the 
K  =  64  empirical  distributions  of  the  coefficients  associated  to  each  of  the  K  =  64  DCT  atoms  (right,  log  scale).  The  coefficients 
were  obtained  from  encoding  around  10®  8x8  patches  (after  removing  their  DC  component)  randomly  sampled  from  the  Pascal 
2006  dataset  of  natural  images  [  5],  Note  the  difference  in  variance  as  we  move  from  the  low  frequencies  (upper  left  corner)  to 
the  higher  frequencies  (lower  right  corner).  Note  also  that  even  for  each  single  atom,  the  distribution  of  the  coefficients  is  not 
a  perfect  Laplacian,  but  something  with  heavier  tails. 


under  any  of  these  distributions,  any  instance  of  A  £  ~RKxN  will  have  measure  0,  that  is,  P( A)  =  0. 
In  order  to  use  such  distributions  as  our  models  for  the  data,  we  assume  that  the  coefficients  in  A  are 
quantized  to  a  precision  A,  small  enough  for  the  density  function  /(a)  to  be  approximately  constant  in 
any  interval  [a  —  A/2,  a  +  A/2],  x  £  M,  so  that  we  can  approximate  P{a)  ~  A /(a),  a£t.  Under  these 
assumptions,  —  logP(a)  ~  —  log /(a)  —  log  A,  and  the  effect  of  A  on  the  codelength  produced  by  any 
model  is  the  same.  Therefore,  we  will  omit  A  in  the  sequel,  and  treat  density  functions  and  probability 
distributions  interchangeably  as  P(-).  Of  course,  in  real  compression  applications,  A  needs  to  be  tuned. 

B.  The  need  for  a  better  model 

As  explained  in  the  previous  subsection,  the  use  of  the  t\  regularizer  implies  that  all  the  coefficients  in 
A  share  the  same  Laplacian  parameter  6.  Flowever,  as  noted  in  [25]  and  references  therein,  the  empirical 
variance  of  coefficients  associated  to  different  atoms,  that  is,  of  the  different  rows  aj.  of  A,  varies 
greatly  with  k  =  1  . . .  ,K.  This  is  clearly  seen  in  Figure  1,  which  shows  the  empirical  distribution  of 
DCT  coefficients  of  8x8  patches.  As  the  variance  of  a  Laplacian  is  2 /O'2,  different  variances  indicate 
different  underlying  0.  The  histogram  of  the  set  ,  &  =  1>  •  •  ■  A  }  °f  estimated  Laplacian  parameters 
for  each  row  k,  Figure  2(left),  shows  that  this  is  indeed  the  case,  with  significant  occurrences  of  values 
of  6  in  a  range  of  5  to  25. 

A  first  refinement  is  thus  an  independent  but  not  identically  distributed  Laplacian  model,  where  each 
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Fig.  2:  Left:  Histogram  of  the  K  =  64  different  9 k  values  obtained  by  fitting  a  Laplacian  distribution  to  each  row  of  A 
for  the  data  in  Figure  1.  Note  that  there  are  significant  occurrences  between  9  =  5  to  9  =  25,  indicating  very  different  widths 
of  the  empirical  distributions  for  different  atoms,  as  hinted  from  Figure  1.  Right:  Histograms  showing  the  variability  of  the  best 
local  estimations  of  6k  for  a  few  rows  of  A  across  different  regions  of  an  image.  The  coefficients  A  correspond  to  the  sparse 
encoding  of  all  8x8  patches  from  a  single  image,  in  scan-line  order.  For  each  k,  each  value  of  9k  was  computed  from  a  random 
contiguous  block  of  250  samples  from  ajf .  The  procedure  was  repeated  4000  times  to  obtain  an  empirical  distribution.  In  all 
cases  shown,  the  wide  supports  of  the  empirical  distributions  indicate  that  the  estimated  9  can  have  very  different  values,  even 
for  the  same  atom,  depending  on  the  region  of  the  data  from  where  the  coefficients  are  taken. 


coefficient  a k  associated  to  atom  d/,.  has  a  different  Laplacian  parameter  0k, 

K 

P(a)  =  l[0ke-9^.  (10) 

k=  1 

Plugging  the  model  (10)  in,  for  example,  (4),  results  in  a  weighted  i\  sparse  coding  formulation, 

K 

a j  =  argmin  |xj  —  D  a||^  +  Afc|afc|,  (11) 

a  k= 1 

where  \k  :=  2o26k-  The  weighted  i\  model  (11),  also  called  weighted  Lasso,  has  been  used  for  example 
in  [2],  [36].  Assume  now  that  a *■  is  the  true  underlying  coefficients  vector  generating  x;,  and  that  A0  is 
its  active  set.  From  a  statistics  perspective,  it  has  been  shown  that,  with  a  proper  choice  of  the  weights 
{Afc}fc=i,...,A>  the  weighted  Lasso  is  an  oracle  estimator,  meaning  that  P(Aj  /  A(]j)  — >  0  and  that  a j  — >  a® 
in  probability,  both  limits  taken  when  M  — ►  oo  [46]. 1 

From  a  modeling  perspective,  instead  of  a  single  parameter,  the  weighted  Lasso  has  K  parameters  to 
deal  with.  This  explosion  of  parameters  is  often  undesirable  and,  yet,  it  may  not  be  accurate  enough  if  we 
consider  that  real  images,  or  other  types  of  signals  such  as  audio  samples,  are  far  from  stationary.  This 

'Unfortunately,  for  sparse  modeling  and  compressive  sensing  applications,  M  is  relatively  small  for  the  results  of  [46]  to  be 
meaningful  (the  asymptotic  results  in  [46]  require  that  M  >  K  so  that  DTD  is  positive  definite,  but  we  usually  have  M  <  K). 
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implies  that,  in  reality,  even  if  each  atom  k  is  associated  to  its  own  Of.  (A/J,  the  optimal  value  of  Of.  can 
have  significant  local  variations  at  different  positions  or  times.  This  effect  is  shown  in  Figure  2(right), 
where,  for  each  k,  each  Ok  was  re-estimated  several  times  using  samples  from  different  regions  of  an 
image,  and  the  histogram  of  the  different  estimated  values  of  Ok  was  computed.  Here  again  we  used  the 
DCT  basis  as  the  dictionary  D. 

The  need  for  a  flexible  model  which  at  the  same  time  has  a  small  number  of  parameters  leads  naturally 
to  Bayesian  formulations  where  the  different  possible  A k  are  “marginalized  out”  by  imposing  an  hyper¬ 
prior  distribution  on  A,  sampling  A  using  its  posterior  distribution,  and  then  averaging  the  estimates 
obtained  with  the  sampled  sparse-coding  problems.  Examples  of  this  recent  line  of  work,  and  the  closely 
related  Bayesian  Compressive  Sensing,  are  developed  for  example  in  [23],  [39],  [45],  [44],  Despite  of 
its  promising  results,  the  Bayesian  approach  is  often  criticized  due  to  the  potentially  expensive  sampling 
process  (something  which  can  be  reduced  for  certain  choices  of  the  priors  involved  [23]),  arbitrariness 
in  the  choice  of  the  priors,  and  lack  of  proper  theoretical  justification  for  the  proposed  models  [44], 

In  this  work  we  pursue  the  same  goal  of  deriving  a  more  flexible  and  accurate  sparse  model  than 
the  traditional  ones,  while  avoiding  an  increase  in  the  number  of  parameters  and  the  burden  of  possibly 
solving  several  sampled  instances  of  the  sparse  coding  problem.  For  this,  we  deploy  tools  from  the 
very  successful  information-theoretic  field  of  universal  coding,  which  is  an  extension  of  the  compression 
scenario  summarized  above  in  Section  II-A,  when  the  probability  model  for  the  data  to  be  described  is 
itself  unknown  and  has  to  be  described  as  well. 

The  result  is  a  framework  and  a  corresponding  family  of  models  that  are  as  accurate  as  (11)  (or  even 
better,  as  we  will  see),  while  maintaining  the  computational  cost  of  sparse  coding  at  a  small  multiple  of 
that  required  for  solving  (2)  with  an  ^i-regularizer. 

III.  Universal  models  for  sparse  coding 

Following  the  discussion  in  the  preceding  section,  we  now  have  several  possible  scenarios  to  deal  with. 
First,  we  may  still  want  to  consider  a  single  value  of  0  to  work  well  for  all  the  coefficients  in  A,  and  try 
to  design  a  sparse  coding  scheme  that  does  not  depend  on  prior  knowledge  on  the  value  of  0.  Secondly, 
we  can  consider  an  independent  (but  not  identically  distributed)  Faplacian  model  where  the  underlying 
parameter  0  can  be  different  for  each  atom  d/.,  k  =  1 , ,I\.  In  the  most  extreme  scenario,  we  can 
consider  each  single  coefficient  akj  in  A  to  have  its  own  unknown  underlying  Okj  and  yet,  we  would 
like  to  encode  each  of  these  coefficients  (almost)  as  if  we  knew  its  hidden  parameter. 

The  first  two  scenarios  are  the  ones  which  fit  the  original  puipose  of  universal  coding  theory  [29], 
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which  is  the  design  of  optimal  codes  for  data  whose  probability  models  arc  unknown,  and  the  models 
themselves  are  to  be  encoded  as  well  in  the  compressed  representation.  The  theory  of  universal  coding 
also  plays  a  central  role  in  the  Minimum  Description  Length  principle  (MDL)  [3],  [21],  a  powerful  model 
selection  tool,  of  particular  relevance  in  all  the  applications  presented  here. 

Now  we  develop  the  basic  ideas  and  techniques  of  universal  coding  applied  to  the  first  scenario, 
where  the  problem  is  to  describe  A  as  an  IID  Laplacian  with  unknown  parameter  0.  Assuming  a  known 
parametric  form  for  the  prior,  with  unknown  parameter  0,  leads  to  the  concept  of  a  model  class.  In  our 

case,  we  consider  the  class  M.  =  {P(A|0)  :  9  G  0}  of  all  IID  Laplacian  models  over  A  G  M.KxN ,  where 

N  K 

P{ A|0)  =  I]  II  P(akj\0)  =  ee~9^ 

j=  1  k=  1 

and  0  C  M+.  The  goal,  following  the  universal  coding  framework,  is  to  find  a  probability  model  Q( A) 
which  can  fit  A  as  well  as  the  model  in  M.  that  best  fits  A  after  having  observed  it.  A  model  Q( A) 
with  this  property  is  called  universal  (with  respect  to  the  model  M). 

For  simplicity,  in  the  following  discussion  we  consider  the  coefficient  matrix  A  to  be  arranged  as  a 
single  long  column  vector  of  length  n  =  K xJV,  a  =  (a±, . . .  ,an).  We  also  use  the  letter  a  without 
sub-index  to  denote  the  value  of  a  random  variable  representing  coefficient  values. 

First  we  need  to  define  a  criterion  for  comparing  the  fitting  quality  of  different  models.  In  universal 
coding  theory  this  is  done  in  terms  of  the  codelengths  L( a)  required  by  each  model  to  describe  a. 

If  the  model  consists  of  a  single  probability  distribution  P(-),  we  known  from  Section  II- A4  that 
the  optimum  codelength  corresponds  to  Lp(a)  =  —  logP(a).  Moreover,  this  relationship  defines  a  one- 
to-one  correspondence  between  distributions  and  codelengths,  so  that  for  any  coding  scheme  Lq(sl), 
Q( a)  =  2~Lq^&> .  Now  suppose  that  we  are  restricted  to  a  class  of  models  M,  and  that  we  need  choose 
the  model  P  G  M  that  assigns  the  shortest  codelength  to  a  particular  instance  of  a.  We  then  have  that 
P  is  the  model  in  M.  that  assigns  the  maximum  probability  to  a.  For  a  class  M.  parameterized  by 
6,  this  corresponds  to  P  =  P(a|0(a)),  where  0(a)  is  the  maximum  likelihood  estimator  (MLE)  of  the 
model  class  parameter  6  given  a  (we  will  usually  omit  the  argument  and  just  write  0).  Unfortunately, 
we  also  need  to  include  the  value  of  9  in  the  description  of  a  for  the  decoder  to  be  able  to  reconstruct 
it  from  the  code  C(a).  Thus,  we  have  that  any  model  Q(a)  inducing  valid  codelengths  /.g(a)  will  have 
Lq(sl)  >  —  logP(a|0).  The  overhead  of  Lq(sl)  with  respect  to  —  log  P(&\6)  is  known  as  the  codelength 
regret, 

TZ{a,Q)  :=  Lq{ a)  -  (- log  P(a| 0(a)))  =  -  log  Q(a)  +  log  P(a|0(a))). 
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A  model  Q(a)  (or,  more  precisely,  a  sequence  of  models,  one  for  each  data  length  n)  is  called  universal 
if  P(a,  Q)  grows  sublinearly  in  n  for  all  possible  realizations  of  a,  that  is 

-K(&,Q)  — >  0  ,  VaG  Mn, 
n 

so  that  the  codelength  regret  with  respect  to  the  MLE  becomes  asymptotically  negligible. 

There  are  a  number  of  ways  to  construct  universal  probability  models.  The  simplest  one  is  the  so  called 
two-part  code,  where  the  data  is  described  in  two  parts.  The  first  part  describes  the  optimal  parameter 
0(a)  and  the  second  part  describes  the  data  according  to  the  model  with  the  value  of  the  estimated 
parameter  9,  P(a|0(a)).  For  uncountable  parameter  spaces  0,  such  as  a  compact  subset  of  M,  the  value 
of  9  has  to  be  quantized  in  order  to  be  described  with  a  finite  number  of  bits  d.  We  call  the  quantized 
parameter  9d.  The  regret  for  this  model  is  thus 

P(a,Q)  =  L(9d)  +  L(a\9d)  -  L(a|0)  =  L{9d)  -  logP(a|0d)  -  (-logP(a|0)). 

The  key  for  this  model  to  be  universal  is  in  the  choice  of  the  quantization  step  for  the  parameter  9,  so 
that  both  its  description  L(9d),  and  the  difference  —  log  P(a\f),i)  —  (—  log  P(a|0)),  grow  sublinearly.  This 
can  be  achieved  by  letting  the  quantization  step  shrink  as  0(l/y/n),  requiring  then  d  =  0(0.5  log  n)  bits 
to  describe  9d,  as  detailed  in  the  proof  of  universality  of  two-part  codes  given  in  [21,  Theorem  10.1].  In 
this  case  the  regret  for  the  two-part  codes  grows  as  log  n,  where  dim(0)  is  the  dimension  of  the 

parameter  space  0. 

Another  important  universal  code  is  the  so  called  Normalized  Maximum  Likelihood  (nml)  [37],  In  this 
case  the  universal  model  Q*(a)  corresponds  to  the  model  that  minimizes  the  worst  case  regret, 

Q*  (a)  =  min  maxi—  log  Q (a)  +  log  P(a|0(a))}, 

Q  a 

which  can  be  written  in  closed  form  as  Q*(&)  =  1  cj'm'u)  where  the  normalization  constant  C(A4,  n)  := 
SaeK"  -f*(a|0(*O)da  is  the  value  of  the  minimax  regret  and  depends  only  on  Ai  and  the  length  of 
the  data  n.  The  fact  that  Q*(a)  is  minimax  worst  case  optimal  derives  from  the  fact  that  it  defines  a 
complete  uniquely  decodable  code  for  all  data  a  of  length  n,  that  is,  it  satisfies  the  Kraft  inequality  with 
equality2  X^aeM"  2~Lq*^  =  1.  Since  every  uniquely  decodable  code  with  length  Lq( a)  has  to  satisfy 
the  Kraft  inequality  (see  [10,  Chapter  5]),  if  there  exists  a  value  of  a  such  that  Lq( a)  <  Lq-( a)  (that  is 
2~ tjQ (a)  >  2~ L«*(a)),  then  there  must  exist  at  least  some  a7  for  which  Lq{s!)  >  Lq*( a')  for  the  Kraft 

2Recall  that  we  are  actually  dealing  with  (finely)  quantized  a,  so  that  the  expression  is  really  a  sum  and  not  an  integral. 
Nevertheless,  we  write  a  €  Rn  since  the  distributions  that  we  are  using  are  defined  over  R. 
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inequality  to  hold.  Therefore  the  regret  of  Q  for  a!  is  necessarily  greater  than  C(A4.  n),  which  shows 
that  Q*  is  minimax  optimal.  Note  that  the  NML  model  requires  that  C (A4 ,  n)  be  finite,  something  which 
is  often  not  the  case. 

The  two  previous  examples  are  good  for  assigning  a  probability  to  coefficients  a  that  have  already 
been  computed,  but  they  cannot  be  used  as  a  model  for  computing  the  coefficients  themselves  since  they 
depend  on  having  observed  them  in  the  first  place.  For  this  and  other  reasons  that  will  become  clearer 
later,  we  concentrate  our  work  on  a  third  important  family  of  universal  codes  derived  from  the  so  called 
mixture  models  (also  called  Bayesian  mixtures).  In  a  mixture  model,  Q(a)  is  a  convex  mixture  of  all 
the  models  P(a\9)  in  M,  indexed  by  the  model  parameter  6,  Q(a )  =  fe  P(a\9)w(9)d9,  where  w(0) 
specifies  the  weight  of  each  model.  Being  a  convex  mixture  implies  that  w(8)  >  0  and  /e  w(0)d0  =  1, 
thus  w(6)  is  itself  a  probability  measure  over  0.  We  will  restrict  ourselves  to  the  particular  case  when 
a  is  considered  a  sequence  of  independent  random  variables,3 

n 

Q{  a)  =  J_  |  Qj  ( aj )  i  Qj(aj )  = 

3= 1 

where  the  mixing  function  Wj(9)  can  be  different  for  each  sample  j.  An  important  particular  case 
of  this  scheme  is  the  so  called  Sequential  Bayes  code,  in  which  Wj(0)  is  computed  sequentially  as  a 
posterior  distribution  based  on  previously  observed  samples,  that  is  w3{0)  =  P(9\a^n~1'>)  [21,  Chapter  6]. 
In  this  work,  for  simplicity,  we  restrict  ourselves  to  the  case  where  Wj{9)  =  w(0)  is  the  same  for  all 
j.  The  result  is  an  IID  model  where  the  probability  of  each  sample  aj  is  a  mixture  of  some  probability 
measure  over  M, 

Qj(aj)  =  Q(aj)=  [  P{aj\e)w{6)d6,  Vj  =  (13) 

Je 

A  central  result  in  universal  coding  theory  for  IID  mixture  (Bayesian)  codes  states  that  their  asymptotic 
regret  is  0(  log  n),  thus  stating  their  universality,  as  long  as  the  weighting  function  w(9)  is  positive, 

continuous  and  unimodal  over  0.  This  gives  us  great  flexibility  on  the  choice  of  a  weighting  function 
w(6)  that  guarantees  universality.  Of  course,  the  results  are  asymptotic  and  the  o(logn)  terms  can  be 
large,  so  that  the  choice  of  w(6)  can  have  practical  impact. 

In  the  following  discussion  we  derive  several  IID  mixture  models  for  the  Laplacian  model  class  M. .  For 
this  purpose,  it  will  be  convenient  to  consider  the  corresponding  one-sided  counterpart  of  the  Laplacian, 
which  is  the  exponential  distribution  over  the  absolute  value  of  the  coefficients,  |a|,  and  then  symmetrize 
back  to  obtain  the  final  distribution  over  the  signed  coefficients  a. 

’More  sophisticated  models  which  include  dependencies  between  the  elements  of  a  are  out  of  the  scope  of  this  work. 


P(aj\6)vjj(0)d0, 


(12) 
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A.  The  conjugate  prior 

In  general,  a  closed  form  solution  of  (13)  if  w(9)  is  the  conjugate  prior  of  P(a\9).  When  P(a\9)  is 
an  exponential  (one-sided  Laplacian),  the  conjugate  prior  is  the  Gamma  distribution, 

w{9\k,0)  =  T (Ac)“10K-1/3Ke-/3e,  6  G  R+, 

where  k  and  /?  are  its  shape  and  scale  parameters  respectively.  Plugging  this  in  (13)  we  obtain  the  Mixture 
of  exponentials  model  (MOE),  which  has  the  following  form  (see  Appendix  A  for  the  full  derivation), 

Qmoe(o|/3) h)  =  k/3k(cl  +  /?)  a  G  M+.  (14) 

With  some  abuse  of  notation,  we  will  also  denote  the  symmetric  distribution  on  a  as  MOE, 

QmoeH/3,  k)  =  ^K(3K(\a\  +  /?)_(K+1),  a  <G  M.  (15) 

Although  the  resulting  prior  has  two  parameters  to  deal  with  instead  of  one,  we  know  from  universal 
coding  theory  that,  in  principle,  any  choice  of  k  and  (3  will  give  us  a  model  whose  codelength  regret  is 
asymptotically  small. 

Furthermore,  being  HD  models,  each  coefficient  of  a  itself  is  modeled  as  a  mixture  of  exponentials, 
which  makes  the  resulting  model  over  a  very  well  suited  to  the  most  flexible  scenario  where  the 
“underlying”  6  can  be  different  for  each  ap  In  Section  V-B  we  will  show  that  a  single  MOE  distribution 
can  fit  each  of  the  K  rows  of  A  better  than  K  separate  Laplacian  distributions  fine-tuned  to  these  rows, 
with  a  total  of  K  parameters  to  be  estimated.  Thus,  not  only  we  can  deal  with  one  single  unknown  8, 
but  we  can  actually  achieve  maximum  flexibility  with  only  two  parameters  (n  and  /?).  This  property  is 
particular  of  the  mixture  models,  and  does  not  apply  to  the  other  universal  models  presented. 

Finally,  if  desired,  both  k  and  (3  can  be  easily  estimated  using  the  method  of  moments  (see  Appendix  A). 
Given  sample  estimates  of  the  first  and  second  non-central  moments,  fa  =  4  ,  | a?  |  and  fa  = 

l  |<F/|2,  we  have  that 

k  =  2{fa  -  p\)/{fa  -  2p\)  and  (3  =  {k  -  l)fa.  (16) 

When  the  MOE  prior  is  plugged  into  (5)  instead  of  the  standard  Laplacian,  the  following  new  sparse 
coding  formulation  is  obtained, 

K 

aj  =  argmin  ||xj  -  Da||2  +  AMOe  ^  log  (|«fc|  +  P) ,  (17) 

k= 1 

where  AM0E  =  2(j2(k+1).  An  example  of  the  MOE  regularizer,  and  the  thresholding  function  it  induces,  is 
shown  in  Figure  3  (center  column)  for  k  =  2.5,/?  =  0.05.  Smooth,  differentiable  non-convex  regularizes 
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such  as  the  one  in  in  (17)  have  become  a  mainstream  robust  alternative  to  the  i\  norm  in  statistics  [16], 
[46].  Furthermore,  it  has  been  shown  that  the  use  of  such  regularizers  in  regression  leads  to  consistent 
estimators  which  are  able  to  identify  the  relevant  variables  in  a  regression  model  (oracle  property)  [16]. 
This  is  not  always  the  case  for  the  t\  regularizer,  as  was  proved  in  [46].  The  MOE  regularizer  has  also 
been  recently  proposed  in  the  context  of  compressive  sensing  [7],  where  it  is  conjectured  to  be  better 
than  the  li-term  at  recovering  sparse  signals  in  compressive  sensing  applications.4  This  conjecture  was 
partially  confirmed  recently  for  non-convex  regularizers  of  the  form  =  ||a||  with  0  <  r  <  1  in 
[35],  [18],  and  for  a  more  general  family  of  non-convex  regularizers  including  the  one  in  (17)  in  [42],  In 
all  cases,  it  was  shown  that  the  conditions  on  the  sensing  matrix  (here  D)  can  be  significantly  relaxed 
to  guarantee  exact  recovery  if  non-convex  regularizers  are  used  instead  of  the  l\  norm,  provided  that 
the  exact  solution  to  the  non-convex  optimization  problem  can  be  computed.  In  practice,  this  regularizer 
is  being  used  with  success  in  a  number  of  applications  here  and  in  [8],  [4 1].5  Our  experimental  results 
in  Section  V  provide  further  evidence  on  the  benefits  of  the  use  of  non-convex  regularizers,  leading  to 
a  much  improved  recovery  accuracy  of  sparse  coefficients  compared  to  fi  and  (q,  which  in  turn  yields 
improvements  in  applications  such  as  denoising  and  classification.  We  also  show  in  Section  V  that  the 
MOE  prior  is  much  more  accurate  than  the  standard  Laplacian  to  model  the  distribution  of  reconstruction 
coefficients  drawn  from  a  large  database  of  image  patches. 


B.  The  Jeffreys  prior 


The  Jeffreys  prior  for  a  parametric  model  class  A4  =  {P(a|0),  6  e  0},  is  defined  as 


=  eee, 


SeJUTW 


where  1(9)  is  the  Fisher  information  matrix'. 

m  =  { Emi) 


-|jjlogP(a|9) 

o6z 


(18) 


(19) 


8=0 


The  Jeffreys  prior  is  well  known  in  Bayesian  theory  due  to  three  important  properties:  it  virtually 
eliminates  the  hyper-parameters  of  the  model,  it  is  invariant  to  the  original  parametrization  of  the  distri¬ 
bution,  and  it  is  a  “non-informative  prior,”  meaning  that  it  represents  well  the  lack  of  prior  information 


4In  [7],  the  logarithmic  regularizer  arises  from  approximating  the  pseudo-norm  as  an  tfi-normalized  element-wise  sum, 
without  the  insight  and  theoretical  foundation  here  reported. 

5While  these  works  support  the  use  of  such  non-convex  regularizers,  none  of  them  formally  derives  them  using  the  universal 
coding  framework  as  in  this  paper. 
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on  the  unknown  parameter  6  [4].  It  turns  out  that,  for  quite  different  reasons,  the  Jeffreys  prior  is  also  of 
paramount  importance  in  the  theory  of  universal  coding.  For  instance,  it  has  been  shown  in  [3]  that  the 
worst  case  regret  of  the  mixture  code  obtained  using  the  Jeffreys  prior  approaches  that  of  the  NML  as 
the  number  of  samples  n  grows.  Thus,  by  using  Jeffreys,  one  can  attain  the  minimum  worst  case  regret 
asymptotically,  while  retaining  the  advantages  of  a  mixture  (not  needing  hindsight  of  a),  which  in  our 
case  means  to  be  able  to  use  it  as  a  model  for  computing  a  via  sparse  coding. 

For  the  exponential  distribution  we  have  that  1(9)  =  ik.  Clearly,  if  we  let  0  =  (0,oo),  the  integral 
in  (18)  evaluates  to  oo.  Therefore,  in  order  to  obtain  a  proper  integral,  we  need  to  exclude  0  and  oo 
from  0  (note  that  this  was  not  needed  for  the  conjugate  prior).  We  choose  to  define  0  =  [6i,62], 
0  <  9\  <  d2  <  oo,  leading  to 


w (9)  =  a,  9  G  [61,62]. 


In  (62/6{)6' 

The  resulting  mixture,  after  being  symmetrized  around  0,  has  the  following  form  (see  Appendix  B): 


-  21®  R  (e~‘M  -  e~°M)  ■  “  6  R+  (20) 

We  refer  to  this  prior  as  a  Jeffreys  mixture  of  exponentials  (JOE),  and  again  overload  this  acronym  to 
refer  to  the  symmetric  case  as  well.  Note  that  although  QJOE  is  not  defined  for  a  =  0,  its  limit  when 
a  — >  0  is  finite  and  evaluates  to  9 )  ■  Thus,  by  defining  (5joe(0)  =  2 in (e^/W) ’  we  °htain  a  prior  that 
is  well  defined  and  continuous  for  all  a£l.  When  plugged  into  (5),  we  get  the  JOE-based  sparse  coding 
formulation, 

K 

mm  ||xj  -  DaUg  +  AJ0E  ^{log \ak\  -  log(e_6ll|afc|  -  e~e^a^)},  (21) 

k= 1 

where,  according  to  the  convention  just  defined  for  (5joe(0),  we  define  V,joe(0)  :=  log(92  —  6i).  According 
to  the  MAP  interpretation  we  have  that  AJ0E  =  2a2,  coming  from  the  Gaussian  assumption  on  the 
approximation  error  as  explained  in  Section  II-A. 

As  with  MOE,  the  JOE-based  regularizer,  ip!OE(-)  =  —  log  (5joe(-)>  is  continuous  and  differentiable  in 
M+,  and  its  derivative  converges  to  a  finite  value  at  zero,  lima_>o  ’tpK)¥{a)  =  pfryf-  As  we  will  see  later 
in  Section  IV,  these  properties  are  important  to  guarantee  the  convergence  of  sparse  coding  algorithms 
using  non-convex  priors. 

Note  from  (21)  that  we  can  rewrite  the  JOE  regularizer  as 


V;joe(oa:)  =  log  |afc|  —  loge  0l|al(l-e  (02  0l)|a|) 


6i\ak\  +  log  \ak\  —  log(l  —  e  (02  01)lafcl). 


For  sufficiently  large  |afc|,  log(l  —  e  {°2  6,1)lafcl)  ~  0,  6\\ak\  log  |afc|,  and  we  have  that  if 

JOE 

6i\ak\.  Thus,  for  large  \ak\,  the  JOE  regularizer  behaves  like  l\  with  X'  =  c1g29\.  In  terms  of  the 
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Fig.  3:  Left  to  right:  t\  (green),  MOE  (red)  and  JOE  (blue)  regularizes  and  their  corresponding  thresholding  functions 
thres(:r)  :=  argmina{(a;  —  a)2  +  At/>(|a|)}.  The  unbiasedness  of  MOE  is  due  to  the  fact  that  large  coefficients  are  not  shrank  by 
the  thresholding  function.  Also,  although  the  JOE  regularize!'  is  biased,  the  shrinkage  of  large  coefficients  can  be  much  smaller 
than  the  one  applied  to  small  coefficients. 


probability  model,  this  means  that  the  tails  of  the  JOE  mixture  behave  like  a  Laplacian  with  6  =  6 1, 
with  the  region  where  this  happens  determined  by  the  value  of  62  —  6\.  The  fact  that  the  non-convex 
region  of  (- )  is  confined  to  a  neighborhood  around  0  could  help  to  avoid  falling  in  bad  local  minima 
during  the  optimization  (see  Section  IV  for  more  details  on  the  optimization  aspects).  Finally,  although 
having  Laplacian  tails  means  that  the  estimated  a  will  be  biased  [16],  the  sharper  peak  at  0  allows  us  to 
perform  a  more  aggressive  thresholding  of  small  values,  without  excessively  clipping  large  coefficients, 
which  leads  to  the  typical  over-smoothing  of  signals  recovered  using  an  t\  regularizer.  See  Figure  3 
(rightmost  column)  for  an  example  regularizer  based  on  JOE  with  parameters  6\  =  20,  62  =  100,  and  the 
thresholding  function  it  induces. 

The  JOE  regularizer  has  two  hyper-parameters  (6 1,  O2)  which  define  0  and  that,  in  principle,  need  to  be 
tuned.  One  possibility  is  to  choose  6\  and  6>  based  on  the  physical  properties  of  the  data  to  be  modeled, 
so  that  the  possible  values  of  6  never  fall  outside  of  the  range  [9\ ,  62].  For  example,  in  modeling  patches 
from  grayscale  images  with  a  limited  dynamic  range  of  [0,  255]  in  a  DCT  basis,  the  maximum  variance 
of  the  coefficients  can  never  exceed  1282.  The  same  is  true  for  the  minimum  variance,  since  unprocessed 
images  always  have  a  small  (albeit  unnoticeable)  base  noise. 

Flaving  said  this,  in  practice  it  is  advantageous  to  adjust  \0\ .  6L]  to  the  data  at  hand.  In  this  case, 
although  no  closed  form  solutions  exist  for  estimating  [#1,6*2]  using  maximum  likelihood  or  the  method 
of  moments,  standard  optimization  techniques  can  be  easily  applied  to  obtain  them.  See  Appendix  B  for 
details. 

C.  The  conditional  Jeffreys 

A  recent  approach  to  deal  with  the  case  when  the  integral  over  0  in  the  Jeffreys  prior  is  improper,  is 
the  conditional  Jeffreys  [21,  Chapter  11].  The  idea  is  to  construct  a  proper  prior,  based  on  the  improper 
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Jeffreys  prior  and  the  first  few  no  samples  of  a,  (ai,  02, ,  a„0),  and  then  use  it  for  the  remaining  data. 
The  key  observation  is  that  although  the  normalizing  integral  f  \J  l  (0)d()  in  the  Jeffreys  prior  is  improper, 
the  unnormalized  prior  w{9)  =  y/l(9)  can  be  used  as  a  measure  to  weight  P(a\,  02, ... ,  ano\6), 

P(a1,a2,  •  •  •  ,ano\9)^/l(9) 


w(9)  = 


feP(a1,a2  ;  -  -  -  ;  an0  1  ovmdt' 


(22) 


It  turns  out  that  the  integral  in  (22)  usually  becomes  proper  for  small  no  in  the  order  of  dim(0).  In 
our  case  we  have  that  for  any  no  >  1,  the  resulting  prior  is  a  Gamma(/to,  j3o)  distribution  with  k 0  :=  no 
and  A)  :=  YJl  1  aj  (see  Appendix  C  for  details).  Therefore,  using  the  conditional  Jeffreys  prior  in  the 
mixture  leads  to  a  particular  instance  of  MOE,  which  we  denote  by  CMOE  (although  the  functional  form 
is  identical  to  MOE),  where  the  Gamma  parameters  k  and  3  are  automatically  selected  from  the  data. 
This  may  explain  in  part  why  the  Gamma  prior  performs  so  well  in  practice,  as  we  will  see  in  Section  V. 

Furthermore,  we  observe  that  the  value  of  3  obtained  with  this  approach  (3o)  coincides  with  the  one 
estimated  using  the  method  of  moments  for  MOE  if  the  k  in  MOE  is  fixed  to  k  =  kq  + 1  =  no  +  1 .  Indeed, 
if  computed  from  no  samples,  the  method  of  moments  for  MOE  gives  3  =  (k  —  1)jo,  with  fi\  =  A  Y  a3. 


which  gives  us  3  = 


 n0 +1—1 


n0 


Y  aj  =  3o-  It  turns  out  in  practice  that  the  value  of  k  estimated  using  the 


method  of  moments  gives  a  value  between  2  and  3  for  the  type  of  data  that  we  deal  with  (see  Section  V), 
which  is  just  above  the  minimum  acceptable  value  for  the  CMOE  prior  to  be  defined,  which  is  no  =  1. 
This  justifies  our  choice  of  no  =  2  when  applying  CMOE  in  practice. 

As  no  becomes  large,  so  does  no  =  no,  and  the  Gamma  prior  w(9)  obtained  with  this  method 
converges  to  a  Kronecker  delta  at  the  mean  value  of  the  Gamma  distribution,  SKom0(-)-  Consequently, 
when  w(9)  «  SKom0(9),  the  mixture  fe  P(a\9)w(9)d9  will  be  close  to  P(cl\ko/ 3o)-  Moreover,  from 
the  definition  of  no  and  3o  we  have  that  kq/3o  is  exactly  the  maximum  likelihood  estimator  of  9  for 
the  Laplacian  distribution.  Thus,  for  large  no,  the  conditional  Jeffreys  method  approaches  the  maximum 
likelihood  Laplacian  model. 

Although  from  a  strict  universal  coding  point  of  view  this  is  not  a  problem,  for  large  no  the  conditional 
Jeffreys  model  will  loose  its  flexibility  to  deal  with  the  case  when  different  coefficients  in  A  have  different 
underlying  9.  On  the  other  hand,  using  a  small  no  can  lead  to  a  prior  w(9)  that  is  severely  overfitted 
to  the  local  properties  of  the  first  samples,  which  for  non- stationary  data  such  as  image  patches,  can  be 
problematic.  Ultimately,  no  defines  a  trade-off  between  the  degree  of  flexibility  and  the  accuracy  of  the 
resulting  model. 
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IV.  Optimization  and  implementation  details 

All  of  the  mixture  models  discussed  so  far  yield  non-convex  regularizers,  rendering  the  sparse  coding 
problem  non-convex  in  a.  It  turns  out  however  that  these  regularizers  satisfy  certain  conditions  which  make 
the  resulting  sparse  coding  optimization  well  suited  to  be  approximated  using  a  sequence  of  successive 
convex  sparse  coding  problems,  a  technique  known  as  Local  Linear  Approximation  (LLA)  [47]  (see  also 
[41],  [19]  for  alternative  optimization  techniques  for  such  non-convex  sparse  coding  problems).  In  a 

nutshell,  suppose  we  need  to  obtain  an  approximate  solution  to 

K 

a j  =  argmin  ||xj  —  Da|j  +  A  V’(lafc|);  (23) 

a  k= l 

where  ijj(-)  is  a  non-convex  function  over  M+.  At  each  LLA  iteration,  we  compute  a^  1  1 J  by  doing  a  first 
order  expansion  of  'ip(-)  around  the  K  elements  of  the  current  estimate  ajy , 

^(M)  =  v>(i4ii) +/(i4?i)  (m  -  i4ji)  =i>'(\4]\)\a\  +  ck, 

and  solving  the  convex  weighted  t\  problem  that  results  after  discarding  the  constant  terms  c^, 

K 

a^+1)  =  argmin  ||  x.j  -  D  a^  +  A  ^  4>k\\ak\) 

k= l 

K  K 

=  argmin  ||xj  -  D  a^  +  A  | ) | a* |  =  argmin  ||xj  -  Dafj  +  A^|afc|-  (24) 

k= 1  k= 1 

where  we  have  defined  Ajjf  :=  A^/(|ajy  |).  If  L'(-)  is  continuous  in  (0, +oo),  and  right-continuous  and 
finite  at  0,  then  the  LLA  algorithm  converges  to  a  stationary  point  of  (23)  [46].  These  conditions  arc  met 
for  both  the  MOE  and  JOE  regularizers.  Although,  for  the  JOE  prior,  the  derivative  'L'(-)  is  not  defined  at 
0,  it  converges  to  the  limit  2(L-e1)  w('en  H  which  is  well  defined  for  02  p1  0\.  If  02  =  8i,  the  JOE 
mixing  function  is  a  Kronecker  delta  and  the  prior  becomes  a  Laplacian  with  parameter  9  =  9\  =  02- 
Therefore  we  have  that  for  all  of  the  mixture  models  studied,  the  LLA  method  converges  to  a  stationary 
point.  In  practice,  we  have  observed  that  5  iterations  are  enough  to  converge.  Thus,  the  cost  of  sparse 
coding,  with  the  proposed  non-convex  regularizers,  is  at  most  5  times  that  of  a  single  i\  sparse  coding, 
and  could  be  less  in  practice  if  warm  restarts  are  used  to  begin  each  iteration. 

Of  course  we  need  a  starting  point  a and,  being  a  non-convex  problem,  this  choice  will  influence 
the  approximation  that  we  obtain.  One  reasonable  choice,  used  in  this  work,  is  to  define  ajy  =  ao,  k  = 
1, . . . ,  K,  j  =  1, . . . ,  N,  where  ao  is  a  scalar  so  that  ip'(ao)  =  Ew[9],  that  is,  so  that  the  first  sparse 
coding  corresponds  to  a  Laplacian  regularize!'  whose  parameter  is  the  average  value  of  9  as  given  by  the 
mixing  prior  w(9). 
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Finally,  note  that  although  the  discussion  here  has  revolved  around  the  Lagrangian  or  MAP  formulation 
to  sparse  coding,  this  technique  is  also  applicable  to  the  constrained  formulation  of  sparse-coding  given 
by  Equation  (1)  for  a  fixed  dictionary  D. 

Comments  on  parameter  estimation:  All  the  universal  models  presented  so  far,  with  the  exception 
of  the  conditional  Jeffreys,  depend  on  hyper-parameters  which  in  principle  should  be  tuned  for  optimal 
performance  (remember  that  they  do  not  influence  the  universality  of  the  model).  If  tuning  is  needed,  it 
is  important  to  remember  that  the  proposed  universal  models  are  intended  for  reconstruction  coefficients 
of  clean  data,  and  thus  their  hyper-parameters  should  be  computed  from  statistics  of  clean  data,  or  either 
by  compensating  the  distortion  in  the  statistics  caused  by  noise  (see  for  example  [30]).  Finally,  note 
that  when  D  is  linearly  dependent  and  rank(D)  =  Mm,  the  coefficients  matrix  A  resulting  from  an 
exact  reconstruction  of  X  will  have  many  zeroes  which  are  not  properly  explained  by  any  continuous 
distribution  such  as  a  Laplacian.  We  sidestep  this  issue  by  computing  the  statistics  only  from  the  non-zero 
coefficients  in  A.  Dealing  properly  with  the  case  P(a  =  0)  >  0  is  beyond  the  scope  of  this  work. 

V.  Experimental  results 

In  the  following  experiments,  the  testing  data  X  are  8x8  patches  drawn  from  the  Pascal  VOC2006 
testing  subset,6  which  are  high  quality  640x480  RGB  images  with  8  bits  per  channel.  For  the  experiments, 
we  converted  the  2600  images  to  grayscale  by  averaging  the  channels,  and  scaled  the  dynamic  range  to 
lie  in  the  [0, 1]  interval.  Similar  results  to  those  shown  here  are  also  obtained  for  other  patch  sizes. 

A.  Dictionary  learning 

For  the  experiments  that  follow,  unless  otherwise  stated,  we  use  a  “global”  overcomplete  dictionary 
D  with  K  =  4 M  =  256  atoms  trained  on  the  full  VOC2006  training  subset  using  an  extension  of  the 
model  (3),  where  we  add  an  additional  dictionary  regularization  term  to  the  cost  function,7 

1  N 

n  £  {N  - D  a^ii2 + A^(a,)} + f  iiDTDit  •  (25) 

3= 1 

II  T  1 1 2 

The  additional  term,  /i  |D  D||f,  encourages  incoherence  in  the  learned  dictionary,  that  is,  it  forces  the 
atoms  to  be  as  orthogonal  as  possible.  Dictionaries  with  lower  coherence  are  well  known  to  have  several 

6http://pascallin.ecs.soton.ac.uk/challenges/VOC/databases.html#VOC2006 

7While  we  could  have  used  off-the-shelf  dictionaries  such  as  DCT  in  order  to  test  our  universal  sparse  coding  framework,  it 
is  important  to  use  dictionaries  that  lead  to  the  state-of-the-art  results  in  order  to  show  the  additional  potential  improvement  of 
our  proposed  regularizers. 
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theoretical  advantages  such  as  improved  ability  to  recover  sparse  signals  [1 1],  [40],  and  faster  and  better 
convergence  to  the  solution  of  the  sparse  coding  problems  (1)  and  (3)  [13].  In  particular,  the  technique 
of  imposing  incoherence  using  the  formulation  in  (25)  was  introduced  in  [33],  where  it  was  shown  to 
lead  to  improvements  in  a  variety  of  sparse  modeling  applications,  including  the  ones  discussed  below. 

For  dictionary  learning  we  use  an  regularizer  with  A  =  0.1,  which  is  typical  in  sparse  coding 
applications,  producing  dictionaries  D  that  lead  to  state-of-the-art  results  [1],  [26].  For  the  incoherence 
we  use  p  =  1,  which  has  also  been  observed  to  be  a  good  value  in  practice  for  this  case.  See  [1],  [26], 
[33]  for  details  on  the  optimization  of  (3)  and  (25). 

B.  MOE  as  a  prior  for  sparse  coding  coefficients 

We  begin  by  comparing  the  performance  of  the  Laplacian  and  MOE  models  (priors)  for  fitting  a  single 
global  distribution  to  the  whole  matrix  A.  We  compute  A  using  (1)  with  e  «  0  and  then,  following  the 
discussion  in  Section  IV,  restrict  our  study  to  the  nonzero  elements  of  A. 

The  empirical  distribution  of  A  is  plotted  in  Figure  4(left),  along  with  the  best  fitting  Laplacian,  MOE, 
JOE,  and  a  particularly  good  example  of  the  conditional  Jeffreys  (CMOE)  distributions.8  The  MLE  for  the 
Laplacian  fit  is  9  =  N\  /  1 1 A 1 1 ,  =  27.2  (here  N\  is  the  number  of  nonzero  elements  in  A).  For  MOE, 
using  (16),  we  obtained  k  =  2.8  and  (3  =  0.07.  For  JOE,  8\  =  2.4  and  82  =  371.4.  According  to  the 
discussion  in  Section  III-C,  we  used  the  value  k  =  2.8  obtained  using  the  method  of  moments  for  MOE 
as  a  hint  for  choosing  no  =  2  (no  =  no  +  1  =  3  «  2.8),  yielding  (3q  =  0.07,  which  coincides  with 
the  (3  obtained  using  the  method  of  moments.  As  observed  in  Figure  4(left),  in  all  cases  the  proposed 
mixture  models  fit  the  data  better,  significantly  better  for  both  Gamma-based  mixtures,  MOE  and  CMOE, 
and  slightly  better  for  JOE.  This  is  further  confirmed  by  the  Kullback-Leibler  divergences  (KL)  obtained 
in  each  case.  As  a  reference,  the  empirical  entropy  of  the  quantized  data  is  H{ A)  =  3.00  bits.  Note  that 
JOE  fails  to  significantly  improve  on  the  Laplacian  mode  due  to  the  excessively  large  estimated  range 
[9\,  62].  In  this  sense,  it  is  clear  that  the  JOE  model  is  very  sensitive  to  its  hyper-parameters,  and  a  better 
and  more  robust  estimation  would  be  needed  for  it  to  be  useful  in  practice. 

Given  these  results,  we  concentrate  the  rest  of  the  experiments  on  the  best  (and  simplest)  case  which 
is  the  MOE  prior  (which,  as  detailed  above,  can  be  derived  from  the  conditional  Jeffreys  as  well,  thus 
representing  both  approaches). 

sTo  compute  the  empirical  distribution,  we  quantized  the  elements  of  A  uniformly  in  steps  of  2-8,  which  for  the  amount  of 
data  available,  gives  us  enough  detail  and  at  the  same  time  reliable  statistics  for  all  the  quantized  values. 
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Fig.  4:  Left:  Empirical  distribution  of  the  sparse  coding  coefficients  a  for  image  patches  (blue  dots),  best  fitting  Laplacian 
(green),  MOE  (red),  CMOE  (orange)  and  JOE  (yellow)  distributions.  The  Laplacian  is  clearly  not  fitting  the  tails  properly,  and 
is  not  sufficiently  peaked  at  zero  either.  The  two  models  based  on  a  Gamma  prior,  MOE  and  CMOE,  provide  an  almost  perfect 
fit.  The  fitted  JOE  is  the  most  sharply  peaked  at  0,  but  doest  not  fit  the  tails  as  tight  as  desired.  The  KL  divergences  are  0.171 
bits  for  the  Laplacian,  0.008  for  MOE,  0.009  for  CMOE  and  0.142  for  JOE.  Right:  KL  divergence  for  the  best  fitting  global 
Laplacian  (dark  green),  per-atom  Laplacian  (light  green),  global  MOE  (dark  red)  and  per-atom  MOE  (light  red),  relative  to  the 
KL  divergence  between  the  globally  fitted  MOE  distribution  and  the  empirical  distribution.  The  horizontal  axis  represents  the 
indexes  of  each  atom,  k  =  1  ordered  according  to  the  difference  in  KL  divergence  between  the  global  MOE  and  the 

per-atom  Laplacian  model.  Note  how  the  global  MOE  outperforms  both  the  global  and  per-atom  Laplacian  models  in  all  but  the 
first  4  cases. 


From  Figure  2(right)  we  know  that  the  optimal  6  varies  locally  across  different  regions,  thus,  we  expect 
the  mixture  models  to  perform  well  also  on  a  per-atom  basis.  This  is  confirmed  in  Figure  4(right),  where 
we  show,  for  each  row  a k,k  =  1  the  difference  in  KL  divergence  between  the  globally  fitted 

MOE  distribution  and  the  best  per-atom  fitted  MOE,  the  globally  fitted  Laplacian,  and  the  per-atom  fitted 
Laplacians  respectively.  The  horizontal  axis,  which  represents  atom  index,  is  sorted  by  increasing  KL 
divergence  difference  between  the  per-atom  fitted  Laplacians  and  the  globally  fitted  MOE  distribution.  As 
can  be  observed,  the  KL  obtained  with  the  global  MOE  is  significantly  smaller  than  the  global  Laplacian 
in  all  cases,  and  even  the  per-atom  Laplacians  in  most  of  the  cases.  This  shows  that  MOE,  with  only 
two  parameters  (which  can  be  easily  estimated,  as  detailed  in  the  text),  is  a  much  better  model  than  K 
Laplacians  (requiring  K  critical  parameters)  fitted  specifically  to  the  coefficients  associated  to  each  atom. 
Whether  these  modeling  improvements  have  a  practical  impact  is  explored  in  the  next  experiments. 

C.  Recovery  of  noisy  sparse  signals 

Flere  we  compare  the  active  set  recovery  properties  of  the  MOE  prior,  compared  to  those  of  the  iq  -based 
one,  on  data  for  which  the  sparsity  assumption  \Af  <  L  holds  exactly  for  all  j,  for  a  small  L.  To  this 
end,  we  obtain  sparse  approximations  to  each  sample  Xj  using  the  ('o-based  Orthogonal  Matching  Pursuit 
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Fig.  5:  Left:  active  set  recovery  accuracy  of  i\  and  MOE  for  truly  sparse  A,  as  defined  in  Section  V-C,  for  two  sparsity 
levels  L,  as  a  function  of  the  noise  variance  a.  The  improvement  of  the  proposed  model  over  l\  is  a  factor  of  5  to  9.  Center: 
PSNR  of  the  recovered  signals  with  respect  to  the  clean  signals,  again  for  the  case  where  A  is  truly  sparse.  In  this  case 
significant  improvements  can  be  observed  at  the  low  SNR  range,  specially  for  highly  sparse  (L  =  5)  signals.  The  performance 
of  both  methods  is  practically  the  same  for  a  >  10.  Right:  denoising  of  real  data.  Results  are  relative  to  OMP  using  a  globally 
trained  dictionary.  The  proposed  mixture  model  offers  the  best  of  both  worlds:  good  patch-level  reconstruction  and  final  image 
reconstruction  after  averaging.  While  loosing  to  the  fi-based  dictionary  solution  for  a  =  10  at  the  patch  level,  the  gap  is  closed 
after  averaging.  For  high  noise,  the  improved  robustness  of  our  method  gives  significantly  better  results. 


algorithm  (OMP)  on  D  [28],  and  record  the  resulting  active  sets  Aj  as  ground  truth.  The  data  is  then 
contaminated  with  additive  Gaussian  noise  of  variance  a  and  the  recovery  is  performed  by  solving  (1) 
for  A  with  e  =  CMo 2  and  either  the  l\  or  the  MOE-based  regularizer  for  ?/>(•).  We  use  C  =  1.32,  which 
is  a  standard  value  in  denoising  applications  (see  for  example  [27]). 

For  each  sample  j,  we  measure  the  error  of  each  method  in  recovering  the  active  set  as  the  Flamming 
distance  between  the  true  and  estimated  support  of  the  corresponding  reconstruction  coefficients.  The 
accuracy  of  the  method  is  then  given  as  the  percentage  of  the  samples  for  which  this  error  falls  below  a 
certain  threshold  T.  Results  are  shown  in  Figure  5(left)  for  L  =  (5, 10)  and  T  =  (2,4)  respectively,  for 
various  values  of  cr.  Note  the  very  significant  improvement  obtained  with  the  proposed  model. 

Given  the  estimated  active  set  Aj,  the  estimated  clean  patch  is  obtained  by  projecting  x;/  onto  the 
subspace  defined  by  the  atoms  that  are  active  according  to  Aj,  using  least  squares  (which  is  the  standard 
procedure  for  denoising  once  the  active  set  is  determined).  We  then  measure  the  PSNR  of  the  estimated 
patches  with  respect  to  the  true  ones.  The  results  are  shown  in  Figure  5(center),  again  for  various  values  of 
a.  As  can  be  observed,  the  MOE-based  recovery  is  significantly  better,  specially  in  the  high  SNR  range. 
Notoriously,  the  more  accurate  active  set  recovery  of  MOE  does  not  seem  to  improve  the  denoising 
performance  in  this  case.  Flowever,  as  we  will  see  next,  it  does  make  a  difference  when  denoising  real 
life  signals,  as  well  as  for  classification  tasks. 
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D.  Recovery  of  real  signals  with  simulated  noise 

This  experiment  is  an  analogue  to  the  previous  one,  when  the  data  are  the  original  natural  image 
patches  (without  forcing  exact  sparsity).  Since  for  this  case  the  sparsity  assumption  is  only  approximate, 
and  no  ground  truth  is  available  for  the  active  sets,  we  compare  the  different  methods  in  terms  of  their 
denoising  performance. 

A  critical  strategy  in  image  denoising  is  the  use  of  overlapping  patches,  where  for  each  pixel  in  the 
image  a  patch  is  extracted  with  that  pixel  as  its  center.  The  patches  are  denoised  independently  as  M- 
dimensional  signals  and  then  recombined  into  the  final  denoised  images  by  simple  averaging.  Although 
this  consistently  improves  the  final  result  in  all  cases,  the  improvement  is  very  different  depending  on  the 
method  used  to  denoise  the  individual  patches.  Therefore,  we  now  compare  the  denoising  performance 
of  each  method  at  two  levels:  individual  patches  and  final  image. 

To  denoise  each  image,  the  global  dictionary  learned  as  described  in  Section  V-A  is  further  adapted  to 
the  noisy  image  patches  to  be  processed  for  a  few  iterations  of  the  learning  algorithm,  using  l\  or  MOE 
respectively  plugged  into  (25),  with  a  coherence  penalty  of  /j  =  1.  For  each  of  these  dictionaries,  we 
solve  (2)  with  e  =  CM  a2,  [27],  and  two  different  regularizers:  first,  with  the  corresponding  regularization 
term  used  to  learn  the  dictionary,  and  then  with  -*/’(•)  =  ||  ||0  (using  OMP).  We  do  the  latter  since  it  is  the 
one  that  often  yields  better  denoising  results  in  practice  (see  [1],  [27]  for  examples).  The  results  for  the 
four  combinations  are  summarized  in  Table  I.  For  the  final  image  results,  we  also  include  results  from 
[1]  (K-SVD),  when  available,  as  a  reference  in  the  table,  and  a  graph  corresponding  to  the  average  final 
image  performance  in  Figure  5(right).  We  also  show  results  for  two  sample  images  in  Figure  6. 

For  er  <  20,  at  the  patch  level,  our  method  gives  results  that  are  very  close  to  those  of  the  (learn- 
ing+coding)  (j+OMP  combination,  which  in  turn  are  worse  than  the  t\+l\  solution  by  0.4dB  on  average. 
However,  both  our  method  and  OMP  give  better  results  after  patch  averaging.  This  is  a  well  known 
empirical  effect  which  is  attributed  in  part  to  the  biasedness  of  i\  -regularized  regression.  The  OMP,  on 
the  other  hand,  gives  less  accurate  per-patch  estimations,  but  the  artifacts  it  produces  are  more  “random” 
and  thus  they  are  canceled-out  when  the  patches  are  averaged  to  build  the  final  image.  In  this  sense,  the 
MOE-based  method  offers  the  best  of  both  worlds,  since  it  gives  better  results  than  OMP  at  a  patch  level, 
and  less  biased,  which  helps  in  the  final  result.  The  benefits  become  clearer  as  we  move  into  the  high 
noise  region  a  >  30,  where  we  obtain  significant  improvements  in  all  cases,  which  are  clearly  visible  in 
Figure  6,  whereas  differences  of  less  than  0.3  —  0.5dB  are  hard  to  distinguish  even  by  close  inspection. 
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Fig.  6:  Sample  image  denoising  results.  Top:  Barbara,  a  =  30.  Bottom:  Boats,  a  —  40.  From  left  to  right:  noisy,  ii/OMP, 
£\t£\,  MOE/MOE.  The  reconstruction  obtained  with  the  proposed  model  is  more  accurate,  as  evidenced  by  a  better  reconstruction 
of  the  texture  in  Barbara,  and  sharp  edges  in  Boats,  and  does  not  produce  the  artifacts  seen  in  both  the  t\  and  £o  reconstructions, 
which  appear  as  black/white  speckles  all  over  Barbara,  and  ringing  on  the  edges  in  Boats. 


E.  Classification  with  universal  sparse  models 

In  this  section  we  apply  our  proposed  universal  models  to  a  classification  problem  where  each  sample 
Xj  is  to  be  assigned  a  class  label  yj  =  1, . . . ,  c,  which  serves  as  an  index  to  the  set  of  possible  classes, 
{Ci,C2,...,Cc}-  The  sample  problem  presented  is  the  Graz’02  bike  detection  problem,9  where  each 
pixel  of  each  testing  image  has  to  be  classified  as  either  background  or  as  part  of  a  bike.  We  follow  the 
classification  framework  given  in  [34],  where  a  statistical  model  for  each  class  Cr  is  learned  by  adapting 
a  dictionary  D,  to  the  training  samples  belonging  to  that  class  using  (25).  Given  the  learned  dictionaries, 
the  label  assigned  to  a  sample  Xj  is  given  by  maximum  likelihood,  y}  =  arg  max,;  />(x;|C,),  where  the 
likelihood  of  x?  belonging  to  class  C,  is  given  in  terms  of  its  sparse  code  given  D,,  a*  ,  computed  via  (4). 
Assuming  that  a*  is  unique  for  alii  =  1, . . . ,  c,  we  have  that  P(xj|Cj)  =  P(xj,a*|Dj),  so  that 

yj  =  argmaxP(xj,a*  |Dj)  =  ar g max P  ( x ,  |  a* ,  D,; )  P  ( a)  |  D , ) 
i  J  i  J  J 

=  argmin  |argmin{— log  P(xj|a,  Dj)  —  log  P(a|Dj)}|  .  (26) 

9http://lear.inrialpes.fr/people/marszalek/data/ig02/ 
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a  =  10 
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h  lo 
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lo 

h 

MOE 

lo 

camera 

30.7/34.0  31.6/33.5 

30.5/33.9 

30.8/34.0 

- 

24.2/27.2 

24.5/27.0 
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- 
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24.4/27.9 

- 
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30.7/33.6  31.0/32.9 
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30.7/33.8 
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24.5/26.8 

25.1/28.0 

25.2/28.2 

- 
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30.4/33.4  30.6/33.1 

30.3/33.4 

30.5/33.5 

- 

25.2/27.8 

24.7/27.6 

25.3/28.1 

25.4/28.2 

- 

lena 

32.3/35.3  32.5/34.7 

32.0/35.5 

32.2/35.5 

35.5 

26.4/29.5 

26.2/29.1 

26.5/30.0 

26.6/30.2 

- 

peppers 

31.9/34.8  32.1/34.4 

31.7/34.8 

31.9/34.8 
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26.3/28.9 

26.5/29.7 

26.6/29.8 

- 

AVER. 

31.1/34.2  31.5/33.7 

30.8/34.2 

31.1/34.3 

- 

24.8/27.5 

24.7/27.2 

25.2/28.4 

25.4/28.6 
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a  =  20 

a  =  40 

learning 

h 

MOE 

[1] 

h 

MOE 

[1] 

coding 

i\  Iq 

MOE 

lo 

h 

to 

MOE 

lo 

camera 

26.7/29.6  27.2/29.6 

26.6/29.9 

26.8/30.0 

- 

21.7/24.7 

22.6/25.3 

22.8/25.6 

23.0/25.8 

- 

barbara 

26.8/30.4  27.2/30.3 

26.5/30.4 

26.8/30.6 

30.8 

21.6/24.3 

21.6/24.5 

22.6/25.5 

22.7/25.7 

- 

boat 

27.1/30.1  27.0/29.9 

27.0/30.2 

27.2/30.3 

30.4 

22.6/25.5 

22.6/25.6 

23.6/26.3 

23.6/26.4 

- 

goldhil 

27.1/29.4  26.8/28.8 

27.1/30.0 

27.2/30.1 

- 

23.7/26.2 

23.4/26.2 

24.2/26.7 

24.2/26.7 

- 

lena 

28.7/32.1  28.6/31.5 

28.5/32.2 

28.7/32.3 

32.4 

24.3/27.6 

24.2/27.6 

25.0/28.2 

25.0/28.3 

- 

peppers 

28.8/31.9  28.7/31.6 

28.5/32.0 

28.7/32.0 

30.8 

23.5/26.7 

23.8/27.0 

24.8/27.8 

24.9/27.9 

- 

AVER. 

27.5/30.5  27.5/30.2 

27.3/30.7 

27.5/30.8 

- 

22.8/25.7 

22.9/25.9 

23.7/26.6 

23.8/26.7 

- 

TABLE  I:  Denoising  results:  in  each  table,  each  column  shows  the  denoising  performance  of  a  learning+coding  combination. 
Results  are  shown  in  pairs,  where  the  left  number  is  the  PSNR  between  the  clean  and  recovered  individual  patches,  and  the  right 
number  is  the  PSNR  between  the  clean  and  recovered  images.  Best  results  are  in  bold.  Our  method  gives  the  best  results  in  all 
cases  at  both  the  patch  and  image  levels  for  a  >  30,  on  average  for  a  =  20,  and  in  most  cases  at  the  image  level  for  a  =  10. 


Since  the  decision  depends  on  the  measured  likelihoods,  it  is  clear  that  the  overall  detection  scheme 
would  benefit  from  accurate  models  for  P(x|a,  Dj)  and  P(a|Dj).  The  idea  is  then  to  use  a  universal 
model  for  /J(a|D(),  as  here  proposed,  instead  of  the  implied  Laplacian  model  used  in  [34], 

In  the  Graz’ 02  dataset,  each  of  the  pixels  can  belong  to  one  of  two  classes:  bike  or  background.  On 
each  of  the  training  images  (which  by  convention  are  the  first  150  even-numbered  images),  we  are  given 
a  mask  that  tells  us  whether  each  pixel  belongs  to  a  bike  or  to  the  background.  We  then  train  a  dictionary 
for  bike  patches  and  another  for  background  patches.  Patches  that  contain  pixels  from  both  classes  are 
assigned  to  the  class  corresponding  to  the  majority  of  their  pixels. 

In  Figure  7  we  show  the  precision  vs.  recall  cun>es  obtained  with  the  detection  framework  when  either 
the  £\  or  the  MOE  regularizers  are  used  for  learning  D  using  (25),  and  then  computing  77,  based  on 
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Fig.  7:  Classification  results.  Left  to  right:  precision  vs.  recall  curve,  a  sample  image  from  the  Graz’02  dataset,  its  ground 
truth,  and  the  corresponding  estimated  maps  obtained  with  l\  and  MOE  for  a  fixed  threshold.  The  precision  vs.  recall  curve 
shows  that  the  mixture  model  gives  a  better  precision  in  all  cases.  In  the  example,  the  classification  obtained  with  MOE  yields 
less  false  positives  and  more  true  positives  than  the  one  obtained  with  i\. 


the  respective  learned  dictionary  D.  The  parameters  for  the  t\  prior  (A),  the  MOE  model  (AM0E)  and  the 
incoherence  term  (p)  were  all  adjusted  by  cross  validation.  The  only  exception  is  the  MOE  parameter  3, 
which  was  chosen  based  on  the  fitting  experiment  as  3  =  0.07.  As  can  be  seen,  the  MOE-based  model 
outperforms  the  t\  in  this  classification  task  as  well,  giving  a  better  precision  for  all  recall  values. 

VI.  Concluding  remarks 

A  framework  for  designing  sparse  modeling  priors  was  introduced  in  this  work,  using  tools  from 
universal  coding.  The  priors  obtained  lead  to  models  with  both  theoretical  and  practical  advantages  over 
the  traditional  Iq  and  (] -based  ones.  In  all  derived  cases,  the  designed  non-convex  problems  are  suitable  to 
be  efficiently  (approximately)  solved  via  a  few  iterations  of  (weighted)  /:)  subproblems.  We  also  showed 
that  these  priors  are  able  to  fit  the  empirical  distribution  of  sparse  codes  of  image  patches  significantly 
better  than  the  traditional  HD  Laplacian  model,  and  even  the  non-identically  distributed  independent 
Laplacian  model  where  a  different  Laplacian  parameter  is  adjusted  to  the  coefficients  associated  to  each 
atom,  thus  showing  the  flexibility  and  accuracy  of  these  proposed  models.  The  additional  flexibility, 
furthermore,  comes  at  a  small  cost  of  only  2  parameters  to  be  easily  tuned  (either  (k.  3)  in  the  MOE 
model,  or  (t9i ,  6*2)  in  the  JOE  model),  instead  of  K  (dictionary  size),  as  in  weighted  Lasso  models.  The 
additional  accuracy  of  the  proposed  models  was  shown  to  have  significant  practical  impact  in  active  set 
recovery  of  sparse  signals,  image  denoising,  and  classification  applications.  Compared  to  the  Bayesian 
approach,  we  avoid  the  potential  burden  of  solving  several  sampled  sparse  problems,  or  being  forced  to 
use  a  conjugate  prior  for  computational  reasons  (although  in  our  case,  a  fortiori,  the  conjugate  prior  does 
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provide  us  with  a  good  model).  Finally,  for  the  presented  models,  parameter  estimation  is  straightforward 
to  implement  and  very  fast  to  compute.  Overall,  as  demonstrated  in  this  paper,  the  introduction  of 
information  theory  tools  can  lead  to  formally  addressing  critical  aspects  of  sparse  modeling. 

Future  work  in  this  direction  includes  the  design  of  priors  that  take  into  account  the  nonzero  mass 
at  a  =  0  that  appears  in  overcomplete  models,  and  online  learning  of  the  model  parameters  from  noisy 
data,  following  for  example  the  technique  in  [30].  We  also  aim  at  applying  this  framework  for  model 
selection  via  the  MDL  principle. 
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Appendix 


A.  Derivation  of  the  MOE  model 
For  the  MOE  case  we  have 


1 


P(a\6)  =  6e~ea ,  and  w(0\k,P)  =  ——6*-1pKe-p0, 

F(«) 

which,  when  plugged  into  (13),  gives 

roc  1  ok  roc 

Q(a\P,  n)  =  /  6e-0a—-eK-1pKe-/3ed6  =  ,  ,  /  e~^a+^0Kd9. 

Je= o  r(«)  r(/c)  Je= o 

After  the  change  of  variables  u  :=  (a  +  (3)9  (u(0)  =  0,  u( oo)  =  oo),  the  integral  can  be  written  as 

k 


Q{a\(3,n)  = 


(3* 


r 

Je= 


r(«)  Je=* 
P 


u 


du 


CL  ~\~  jd  J  CL  (3 


ok  roc 

^(a  +  (3)~^  /  e~uukdu 
1  Je= o 


r(K)(a  +  ^)-(K+1)r(K+l)  =  J?-(a  +  (3)-^K  r(«), 


obtaining  Q(a\P,  k)  =  k (3K(a  +  f3)~(K+1\  since  the  integral  on  the  second  line  is  precisely  the  definition 
of  r(«  +  1).  The  symmetrization  is  obtained  by  substituting  a  by  |a|  and  dividing  the  normalization 
constant  by  two,  Q(\a\\/3,  k)  =  Q.hn(3K(\a\  +  /?)_^K+1). 


I0http://www.di. ens.fr/willow/SPAMS/ 
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The  mean  of  the  MOE  distribution  (which  is  defined  only  for  k  >  1)  can  be  easily  computed  using 
integration  by  parts, 


h((3,k)  =  k(3 k  [ 
Jo 


0  (rt  +  /?)(K+1) 


du  =  k/3 


=  (5K[- 


(k  —  l)(it  +  /3)(fc  *7 


k(u  +  py 
p 

K  —  1 


00  I  r  du 

0  +  «  io  («  +  /?)fc 


In  the  same  way,  it  is  easy  to  see  that  the  non-central  moments  of  order  i  are  /i,  = 


 /3 


("71)  ’ 


The  MLE  estimates  of  k  and  l3  can  be  obtained  using  any  nonlinear  optimization  technique  such  as 
Newton  method,  using  for  example  the  estimates  obtained  with  the  method  of  moments  as  a  starting  point. 
In  practice,  however,  we  have  not  observed  any  significant  improvement  in  using  the  MLE  estimates  over 
the  moments-based  ones. 


B.  Derivation  of  the  constrained  Jeffreys  (JOE)  model 

In  the  case  of  the  exponential  distribution,  the  Fisher  Information  Matrix  in  (19)  evaluates  to 
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By  plugging  this  result  into  (18)  with  0  =  [#1,6*2],  0  <  #1  <  #2  <  00  we  obtain  w(6)  =  g- 

We  now  derive  the  (one-sided)  JOE  probability  density  function  by  plugging  this  w(0)  in  (13), 
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Although  Q(a)  cannot  be  evaluated  at  a  =  0,  the  limit  for  a  — ►  0  exists  and  is  finite,  so  we  can  just 
define  Q(0)  as  this  limit,  which  is 
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Again,  if  desired,  parameter  estimation  can  be  done  for  example  using  maximum  likelihood  (via 
nonlinear  optimization),  or  using  the  method  of  moments.  However,  in  this  case,  the  method  of  moments 
does  not  provide  a  closed  form  solution  for  (6*1 ,  #2)  -  The  non-central  moments  of  order  i  are 
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For  i  =  1,  both  integrals  in  (27)  are  trivially  evaluated,  yielding  m  =  (0\  1  —  $2  '  )■  F°r  *  >  1. 

these  integrals  can  be  solved  using  integration  by  parts: 
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where  the  first  term  in  the  right  hand  side  of  both  equations  evaluates  to  0  for  i  >  1.  Therefore,  for  i  >  1 
we  obtain  the  recursions 

i—1  i  i—1 


Ht  = 
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which,  combined  with  the  result  for  i  =  1,  gives  the  final  expression  for  all  the  moments  of  order  i  >  0 
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In  particular,  for  i  =  1  and  i  =  2  we  have 

0!  =  (ln(02/0i)m  +  0-1)-1  ,  9-2  =  (H02/ei)H2  +  Of2)'1 , 


which,  when  combined,  give  us 
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(28) 


One  possibility  is  to  solve  the  nonlinear  equation  02/0i  =  in(92/fli)/P  for  u  =  Oi /02  by  finding  the 


_  fjl2+\nufif 

fj.2— Inufij 

Another  possibility  is  to  simply  fix  the  ratio  92/9\  beforehand  and  solve  for  9\  and  92  using  (28). 


roots  of  the  nonlinear  equation  u  =  and  choosing  one  of  them  based  on  some  side  information. 


C.  Derivation  of  the  conditional  Jeffreys  (CMOE)  model 

The  conditional  Jeffreys  method  defines  a  proper  prior  w{6)  by  assuming  that  uq  samples  from  the  data 
to  be  modeled  a  were  already  observed.  Plugging  the  Fisher  information  for  the  exponential  distribution, 
1(9)  =  9~2,  into  (22)  we  obtain 
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Denoting  So  =  J2j=  \  o?  and  performing  the  change  of  variables  u  :=  S(f  we  obtain 

Qn0-ie-s0o  SQ09n°~1e~s°9 
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where  the  last  equation  derives  from  the  definition  of  the  Gamma  function,  T(no).  We  see  that  the 
resulting  prior  w(9)  is  a  Gamma  distribution  Gamma(KQ,  To)  with  kq  =  no  and  So  =  Sq  =  aj- 
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